{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 模型融合方法概述\n",
    " \n",
    "在比赛中提高成绩主要有3个地方\n",
    "\n",
    "1. 特征工程\n",
    "2. 调参\n",
    "3. 模型融合\n",
    "\n",
    "## 1. Voting （投票）\n",
    "模型融合其实也没有想象的那么高大上，从最简单的Voting说起，这也可以说是一种模型融合。假设对于一个二分类问题，有3个基础模型，那么就采取投票制的方法，投票多者确定为最终的分类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voting_clf = VotingClassifier(estimators=[\n",
    "            (\"lgb_clf\",lgb.LGBMClassifier(learning_rate=0.2129531328773979,\n",
    "                                         min_child_samples=7, #它的值取决于训练数据的样本个树和num_leaves. 将其设置的较大可以避免生成一个过深的树, 但有可能导致欠拟合。\n",
    "                                         max_depth=7, #设置树深度，深度越大可能过拟合\n",
    "                                         lambda_l1=2,boosting=\"gbdt\",objective=\"multiclass\",\n",
    "                                         n_estimators=2000,metric='multi_error',num_class=3,\n",
    "                                         feature_fraction=0.75,bagging_fraction=0.85,seed=99,\n",
    "                                         num_threads=20,verbose=-1,n_jobs=-1,device=\"cpu\")),\n",
    "#             (\"gdbt_clf\",GradientBoostingClassifier(n_estimators=1200, max_depth=9, \n",
    "#                                                         min_samples_split=6, learning_rate=0.22461383601711735)),\n",
    "            (\"xgb_clf\",xgb.XGBClassifier(max_depth=4,learning_rate=0.39607418419884505,n_estimators=200,\n",
    "                                  silent=True,objective='multi:softmax')),\n",
    "            (\"cat_clf\",CatBoostClassifier(iterations=1700, depth=5,learning_rate=0.40809075477900003)),\n",
    "#             (\"adaboost_clf\",AdaBoostClassifier(DecisionTreeClassifier(max_depth=9, min_samples_split=20, min_samples_leaf=5),\n",
    "#                          algorithm=\"SAMME\",\n",
    "#                          n_estimators=1000, learning_rate=0.7))\n",
    "        ],voting=\"soft\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.Averaging（平均）\n",
    "对于回归问题，一个简单直接的思路是取平均。稍稍改进的方法是进行加权平均。权值可以用排序的方法确定，举个例子，比如A、B、C三种基本模型，模型效果进行排名，假设排名分别是1，2，3，那么给这三个模型赋予的权值分别是3/6、2/6、1/6\n",
    "这两种方法看似简单，其实后面的高级算法也可以说是基于此而产生的，Bagging或者Boosting都是一种把许多弱分类器这样融合成强分类器的思想。\n",
    "\n",
    "平均法一般是针对回归问题的。字面意思理解就是对结果求平均值。代码：kaggle_avg.py\n",
    "但是直接对结果求平均值会有度量上的问题，不同的方法预测的结果，融合的时候波动较小的起的作用就比较小，为了解决这个问题，提出了Rank Averaging，先将回归的值进行排序，在利用均匀分布打分。代码：kaggle_rankavg.py\n",
    "一个小栗子：\n",
    "\n",
    "*** 模型1 ***\n",
    "> Id,Prediction\n",
    "1,0.35000056\n",
    "2,0.35000002\n",
    "3,0.35000098\n",
    "4,0.35000111\n",
    "\n",
    "*** 模型2 ***\n",
    "> Id,Prediction\n",
    "1,0.57\n",
    "2,0.04\n",
    "3,0.99\n",
    "4,0.96\n",
    "### 先将结果排序：\n",
    "*** 模型1：***\n",
    "> Id,Rank,Prediction\n",
    "1,1,0.35000056\n",
    "2,0,0.35000002\n",
    "3,2,0.35000098\n",
    "4,3,0.35000111\n",
    "\n",
    "***模型2：***\n",
    "> Id,Rank,Prediction\n",
    "1,1,0.57\n",
    "2,0,0.04\n",
    "3,3,0.99\n",
    "4,2,0.96\n",
    "### 对排序结果进行归一化：\n",
    "\n",
    "***模型1：***\n",
    "> Id,Prediction\n",
    "1,0.33\n",
    "2,0.0\n",
    "3,0.66\n",
    "4,1.0\n",
    "\n",
    "***模型2：***\n",
    "> Id,Prediction\n",
    "1,0.33\n",
    "2,0.0\n",
    "3,1.0\n",
    "4,0.66\n",
    "\n",
    "再对归一化后的排序结果融合打分：\n",
    "\n",
    "> Id,Prediction\n",
    "1,0.33\n",
    "2,0.0\n",
    "3,0.83\n",
    "4,0.83"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#kaggle_avg.py\n",
    "from collections import defaultdict\n",
    "from glob import glob\n",
    "import sys\n",
    "\n",
    "glob_files = sys.argv[1]\n",
    "loc_outfile = sys.argv[2]\n",
    "\n",
    "def kaggle_bag(glob_files, loc_outfile, method=\"average\", weights=\"uniform\"):\n",
    "  if method == \"average\":\n",
    "    scores = defaultdict(float)\n",
    "  with open(loc_outfile,\"w\") as outfile:\n",
    "    for i, glob_file in enumerate( glob(glob_files) ):\n",
    "      print(\"parsing: {}\".format(glob_file))\n",
    "      # sort glob_file by first column, ignoring the first line\n",
    "      lines = open(glob_file).readlines()\n",
    "      lines = [lines[0]] + sorted(lines[1:])\n",
    "      for e, line in enumerate( lines ):\n",
    "        if i == 0 and e == 0:\n",
    "          outfile.write(line)\n",
    "        if e > 0:\n",
    "          row = line.strip().split(\",\")\n",
    "          scores[(e,row[0])] += float(row[1])\n",
    "    for j,k in sorted(scores):\n",
    "      outfile.write(\"%s,%f\\n\"%(k,scores[(j,k)]/(i+1)))\n",
    "    print(\"wrote to {}\".format(loc_outfile))\n",
    "\n",
    "kaggle_bag(glob_files, loc_outfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#kaggle_rankavg.py\n",
    "from __future__ import division\n",
    "from collections import defaultdict\n",
    "from glob import glob\n",
    "import sys\n",
    "\n",
    "glob_files = sys.argv[1]\n",
    "loc_outfile = sys.argv[2]\n",
    "\n",
    "def kaggle_bag(glob_files, loc_outfile):\n",
    "  with open(loc_outfile,\"w\") as outfile:\n",
    "    all_ranks = defaultdict(list)\n",
    "    for i, glob_file in enumerate( glob(glob_files) ):\n",
    "      file_ranks = []\n",
    "      print(\"parsing: {}\".format(glob_file))\n",
    "      # sort glob_file by first column, ignoring the first line\n",
    "      lines = open(glob_file).readlines()\n",
    "      lines = [lines[0]] + sorted(lines[1:])\n",
    "      for e, line in enumerate( lines ):\n",
    "        if e == 0 and i == 0:\n",
    "          outfile.write( line )\n",
    "        elif e > 0:\n",
    "          r = line.strip().split(\",\")\n",
    "          file_ranks.append( (float(r[1]), e, r[0]) )\n",
    "      for rank, item in enumerate( sorted(file_ranks) ):\n",
    "        all_ranks[(item[1],item[2])].append(rank)\n",
    "    average_ranks = []\n",
    "    for k in sorted(all_ranks):\n",
    "      average_ranks.append((sum(all_ranks[k])/len(all_ranks[k]),k))\n",
    "    ranked_ranks = []\n",
    "    for rank, k in enumerate(sorted(average_ranks)):\n",
    "      ranked_ranks.append((k[1][0],k[1][1],rank/(len(average_ranks)-1)))\n",
    "    for k in sorted(ranked_ranks):\n",
    "      outfile.write(\"%s,%s\\n\"%(k[1],k[2]))\n",
    "    print(\"wrote to {}\".format(loc_outfile))\n",
    "\n",
    "kaggle_bag(glob_files, loc_outfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Bagging\n",
    "Bagging就是采用有放回的方式进行抽样，用抽样的样本建立子模型,对子模型进行训练，这个过程重复多次，最后进行融合。大概分为这样两步：\n",
    "1. 重复K次    \n",
    "有放回地重复抽样建模    \n",
    "训练子模型\n",
    "\n",
    "2. 模型融合    \n",
    "分类问题：voting    \n",
    "回归问题：average\n",
    "\n",
    "Bagging算法不用我们自己实现，随机森林就是基于Bagging算法的一个典型例子，采用的基分类器是决策树。R和python都集成好了，直接调用。\n",
    "\n",
    "## 4. Boosting\n",
    "\n",
    "Bagging算法可以并行处理，而Boosting的思想是一种迭代的方法，每一次训练的时候都更加关心分类错误的样例，给这些分类错误的样例增加更大的权重，下一次迭代的目标就是能够更容易辨别出上一轮分类错误的样例。最终将这些弱分类器进行加权相加。引用加州大学欧文分校Alex Ihler教授的两页PPT\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538991385596_8amDUFDBhv.jpg)\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538991393147_T03MOs6DoK.jpg)\n",
    "\n",
    "同样地，基于Boosting思想的有AdaBoost、GBDT等，在R和python也都是集成好了直接调用。\n",
    "\n",
    "PS：理解了这两点，面试的时候关于Bagging、Boosting的区别就可以说上来一些，问Randomfroest和AdaBoost的区别也可以从这方面入手回答。也算是留一个小问题，随机森林、Adaboost、GBDT、XGBoost的区别是什么？\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Stacking\n",
    "\n",
    "Stacking方法其实弄懂之后应该是比Boosting要简单的，毕竟小几十行代码可以写出一个Stacking算法。我先从一种“错误”但是容易懂的Stacking方法讲起。\n",
    "\n",
    "Stacking模型本质上是一种分层的结构，这里简单起见，只分析二级Stacking.假设我们有3个基模型M1、M2、M3。\n",
    "\n",
    "1. 基模型M1，对训练集train训练，然后用于预测train和test的标签列，分别是P1，T1\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538991773172_TjSh5q90VO.jpg)\n",
    "\n",
    "对于M2和M3，重复相同的工作，这样也得到P2,T2,P3,T3。\n",
    "\n",
    "2. 分别把P1,P2,P3以及T1,T2,T3合并，得到一个新的训练集和测试集train2,test2.\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538991843005_4MlYALj032.jpg)\n",
    "\n",
    "3. 再用第二层的模型M4训练train2,预测test2,得到最终的标签列。\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538991912508_JG3Qz1YEom.jpg)\n",
    "\n",
    "Stacking本质上就是这么直接的思路，但是这样肯定是不行的，问题在于P1的得到是有问题的，用整个训练集训练的模型反过来去预测训练集的标签，毫无疑问过拟合是非常非常严重的，因此现在的问题变成了如何在解决过拟合的前提下得到P1、P2、P3，这就变成了熟悉的节奏——K折交叉验证。我们以2折交叉验证得到P1为例,假设训练集为4行3列\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992078446_JHfpRWc5XW.jpg)\n",
    "\n",
    "将其划分为2部分\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992139082_JXDqwmcBli.jpg)\n",
    "\n",
    "用traina训练模型M1，然后在trainb上进行预测得到preb3和pred4\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992097358_zzDLHEVZ4F.jpg)\n",
    "\n",
    "在trainb上训练模型M1，然后在traina上进行预测得到pred1和pred2\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992194352_RbEV6XjRKl.jpg)\n",
    "\n",
    "然后把两个预测集进行拼接\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992221909_tt0tFa5efL.jpg)\n",
    "\n",
    "对于测试集T1的得到，有两种方法。注意到刚刚是2折交叉验证，M1相当于训练了2次，所以一种方法是每一次训练M1，可以直接对整个test进行预测，这样2折交叉验证后测试集相当于预测了2次，然后对这两列求平均得到T1。\n",
    "或者直接对测试集只用M1预测一次直接得到T1。\n",
    "P1、T1得到之后，P2、T2、P3、T3也就是同样的方法。理解了2折交叉验证，对于K折的情况也就理解也就非常顺利了。所以最终的代码是两层循环，第一层循环控制基模型的数目，每一个基模型要这样去得到P1，T1，第二层循环控制的是交叉验证的次数K，对每一个基模型，会训练K次最后拼接得到P1，取平均得到T1。\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538992306573_SR6tNscSGd.jpg)\n",
    "\n",
    "该图是一个基模型得到P1和T1的过程，采用的是5折交叉验证，所以循环了5次，拼接得到P1，测试集预测了5次，取平均得到T1。而这仅仅只是第二层输入的一列/一个特征，并不是整个训练集。再分析作者的代码也就很清楚了。也就是刚刚提到的两层循环。\n",
    "\n",
    "## 5.1 stacking方法\n",
    "\n",
    "将训练好的所有基模型对整个训练集进行预测，第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值，最后基于新的训练集进行训练。同理，预测的过程也要先经过所有基模型的预测形成新的测试集，最后再对测试集进行预测：\n",
    "\n",
    "![](http://aliyuntianchipublic.cn-hangzhou.oss-pub.aliyun-inc.com/public/files/image/null/1538993852611_7Ac5Tjvsn1.jpg)\n",
    "\n",
    "下面我们介绍一款功能强大的stacking利器，mlxtend库，它可以很快地完成对sklearn模型地stacking。\n",
    "\n",
    "主要有以下几种使用方法吧：\n",
    "\n",
    "I. 最基本的使用方法，即使用前面分类器产生的特征输出作为最后总的meta-classifier的输入数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    " \n",
    "iris = datasets.load_iris()\n",
    "X, y = iris.data[:, 1:3], iris.target\n",
    " \n",
    "from sklearn import model_selection\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.naive_bayes import GaussianNB \n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from mlxtend.classifier import StackingClassifier\n",
    "import numpy as np\n",
    " \n",
    "clf1 = KNeighborsClassifier(n_neighbors=1)\n",
    "clf2 = RandomForestClassifier(random_state=1)\n",
    "clf3 = GaussianNB()\n",
    "lr = LogisticRegression()\n",
    "sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], \n",
    "                          meta_classifier=lr)\n",
    " \n",
    "print('3-fold cross validation:\\n')\n",
    " \n",
    "for clf, label in zip([clf1, clf2, clf3, sclf], \n",
    "                      ['KNN', \n",
    "                       'Random Forest', \n",
    "                       'Naive Bayes',\n",
    "                       'StackingClassifier']):\n",
    " \n",
    "    scores = model_selection.cross_val_score(clf, X, y, \n",
    "                                              cv=3, scoring='accuracy')\n",
    "    print(\"Accuracy: %0.2f (+/- %0.2f) [%s]\" \n",
    "          % (scores.mean(), scores.std(), label))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "II. 另一种使用第一层基本分类器产生的类别概率值作为meta-classfier的输入，这种情况下需要将StackingClassifier的参数设置为 use_probas=True。如果将参数设置为 average_probas=True，那么这些基分类器对每一个类别产生的概率值会被平均，否则会拼接。\n",
    "\n",
    " \n",
    "例如有两个基分类器产生的概率输出为：\n",
    "\n",
    "classifier 1: [0.2, 0.5, 0.3]\n",
    "\n",
    "classifier 2: [0.3, 0.4, 0.4]\n",
    "\n",
    "1) average = True : \n",
    "\n",
    "产生的meta-feature 为：[0.25, 0.45, 0.35]\n",
    "\n",
    "2) average = False:\n",
    "\n",
    "产生的meta-feature为：[0.2, 0.5, 0.3, 0.3, 0.4, 0.4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    " \n",
    "iris = datasets.load_iris()\n",
    "X, y = iris.data[:, 1:3], iris.target\n",
    " \n",
    "from sklearn import model_selection\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.naive_bayes import GaussianNB \n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from mlxtend.classifier import StackingClassifier\n",
    "import numpy as np\n",
    " \n",
    "clf1 = KNeighborsClassifier(n_neighbors=1)\n",
    "clf2 = RandomForestClassifier(random_state=1)\n",
    "clf3 = GaussianNB()\n",
    "lr = LogisticRegression()\n",
    "sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],\n",
    "                          use_probas=True,\n",
    "                          average_probas=False,\n",
    "                          meta_classifier=lr)\n",
    " \n",
    "print('3-fold cross validation:\\n')\n",
    " \n",
    "for clf, label in zip([clf1, clf2, clf3, sclf], \n",
    "                      ['KNN', \n",
    "                       'Random Forest', \n",
    "                       'Naive Bayes',\n",
    "                       'StackingClassifier']):\n",
    " \n",
    "    scores = model_selection.cross_val_score(clf, X, y, \n",
    "                                              cv=3, scoring='accuracy')\n",
    "    print(\"Accuracy: %0.2f (+/- %0.2f) [%s]\" \n",
    "          % (scores.mean(), scores.std(), label))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "III. 另外一种方法是对训练基中的特征维度进行操作的，这次不是给每一个基分类器全部的特征，而是给不同的基分类器分不同的特征，即比如基分类器1训练前半部分特征，基分类器2训练后半部分特征（可以通过sklearn 的pipelines 实现）。最终通过StackingClassifier组合起来。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'mlxtend'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-3-388bec8bb9f8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mload_iris\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mmlxtend\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclassifier\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mStackingClassifier\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mmlxtend\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeature_selection\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mColumnSelector\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpipeline\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmake_pipeline\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinear_model\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'mlxtend'"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from mlxtend.classifier import StackingClassifier\n",
    "from mlxtend.feature_selection import ColumnSelector\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    " \n",
    "iris = load_iris()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    " \n",
    "pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),\n",
    "                      LogisticRegression())\n",
    "pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),\n",
    "                      LogisticRegression())\n",
    " \n",
    "sclf = StackingClassifier(classifiers=[pipe1, pipe2], \n",
    "                          meta_classifier=LogisticRegression())\n",
    " \n",
    "sclf.fit(X, y)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "StackingClassifier 使用API及参数解析：\n",
    "\n",
    " \n",
    "\n",
    "StackingClassifier(classifiers, meta_classifier, use_probas=False, average_probas=False, verbose=0, use_features_in_secondary=False)\n",
    "\n",
    "**参数：**\n",
    "\n",
    "classifiers : 基分类器，数组形式，[cl1, cl2, cl3]. 每个基分类器的属性被存储在类属性 self.clfs_.\n",
    "meta_classifier : 目标分类器，即将前面分类器合起来的分类器\n",
    "\n",
    "use_probas : bool (default: False) ，如果设置为True， 那么目标分类器的输入就是前面分类输出的类别概率值而不是类别标签\n",
    "\n",
    "average_probas : bool (default: False)，用来设置上一个参数当使用概率值输出的时候是否使用平均值。\n",
    "\n",
    "verbose : int, optional (default=0)。用来控制使用过程中的日志输出，当 verbose = 0时，什么也不输出， verbose = 1，输出回归器的序号和名字。verbose = 2，输出详细的参数信息。verbose > 2, 自动将verbose设置为小于2的，verbose -2.\n",
    "\n",
    "use_features_in_secondary : bool (default: False). 如果设置为True，那么最终的目标分类器就被基分类器产生的数据和最初的数据集同时训练。如果设置为False，最终的分类器只会使用基分类器产生的数据训练。\n",
    "\n",
    "**属性：**\n",
    "\n",
    "clfs_ : 每个基分类器的属性，list, shape 为 [n_classifiers]。\n",
    "meta_clf_ : 最终目标分类器的属性\n",
    "\n",
    "**方法：**\n",
    "\n",
    "fit(X, y)      \n",
    "fit_transform(X, y=None, fit_params)      \n",
    "get_params(deep=True)，如果是使用sklearn的  GridSearch方法，那么返回分类器的各项参数。\n",
    "predict(X)    \n",
    "predict_proba(X)    \n",
    "\n",
    "score(X, y, sample_weight=None)， 对于给定数据集和给定label，返回评价accuracy\n",
    "\n",
    "set_params(params)，设置分类器的参数，params的设置方法和sklearn的格式一样。\n",
    "\n",
    "\n",
    "参考链接： \n",
    "\n",
    "https://zhuanlan.zhihu.com/p/25836678\n",
    "\n",
    "https://blog.csdn.net/willduan1/article/details/73618677"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Blending (非交叉堆叠)\n",
    "Blending的出现是为了解决Stacking在交叉验证阶段出现的数据泄露（stage2的input会包含output的信息，具体参考为什么做stacking ensemble的时候需要固定k-fold？），容易产生过拟合，Blending直接使用不相交的数据集用于不同层的训练，通俗的理解就是不做交叉验证，而是将训练集分成3:7两个部分，70%作为训练集，对30%验证集和测试集进行预测，第二层是对30%验证集的预测结果进行训练，不存在数据泄露的问题。但是存在30%验证集数量较少，容易过拟合的问题，所以在实际融合中，使用Stacking还是Blending是有很多Trick的。\n",
    "下面来一个Blending的例子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import cross_validation\n",
    "from sklearn.metrics import log_loss, accuracy_score\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import random\n",
    "import md5\n",
    "import json\n",
    "\n",
    "def blend_proba(clf, X_train, y, X_test, nfolds=5, save_preds=\"\",\n",
    "                save_test_only=\"\", seed=300373, save_params=\"\",\n",
    "                clf_name=\"XX\", generalizers_params=[], minimal_loss=0,\n",
    "                return_score=False, minimizer=\"log_loss\"):\n",
    "  print(\"\\nBlending with classifier:\\n\\t{}\".format(clf))\n",
    "  folds = list(cross_validation.StratifiedKFold(y, nfolds,shuffle=True,random_state=seed))\n",
    "  print(X_train.shape)\n",
    "  dataset_blend_train = np.zeros((X_train.shape[0],np.unique(y).shape[0]))\n",
    "\n",
    "  #iterate through train set and train - predict folds\n",
    "  loss = 0\n",
    "  for i, (train_index, test_index) in enumerate( folds ):\n",
    "    print(\"Train Fold {}/{}}\".format(i+1,nfolds))\n",
    "    fold_X_train = X_train[train_index]\n",
    "    fold_y_train = y[train_index]\n",
    "    fold_X_test = X_train[test_index]\n",
    "    fold_y_test = y[test_index]\n",
    "    clf.fit(fold_X_train, fold_y_train)\n",
    "\n",
    "    fold_preds = clf.predict_proba(fold_X_test)\n",
    "    print(\"Logistic loss: {}\".format(log_loss(fold_y_test,fold_preds)))\n",
    "    dataset_blend_train[test_index] = fold_preds\n",
    "    if minimizer == \"log_loss\":\n",
    "      loss += log_loss(fold_y_test,fold_preds)\n",
    "    if minimizer == \"accuracy\":\n",
    "      fold_preds_a = np.argmax(fold_preds, axis=1)\n",
    "      loss += accuracy_score(fold_y_test,fold_preds_a)\n",
    "    #fold_preds = clf.predict(fold_X_test)\n",
    "\n",
    "    #loss += accuracy_score(fold_y_test,fold_preds)\n",
    "\n",
    "    if minimal_loss > 0 and loss > minimal_loss and i == 0:\n",
    "      return False, False\n",
    "    fold_preds = np.argmax(fold_preds, axis=1)\n",
    "    print(\"Accuracy:      {}\".format(accuracy_score(fold_y_test,fold_preds)))\n",
    "  avg_loss = loss / float(i+1)\n",
    "  print(\"\\nAverage:\\t{}\\n\".format(avg_loss))\n",
    "  #predict test set (better to take average on all folds, but this is quicker)\n",
    "  print(\"Test Fold 1/1\")\n",
    "  clf.fit(X_train, y)\n",
    "  dataset_blend_test = clf.predict_proba(X_test)\n",
    "\n",
    "  if clf_name == \"XX\":\n",
    "    clf_name = str(clf)[1:3]\n",
    "\n",
    "  if len(save_preds)>0:\n",
    "    id = md5.new(\"{}\"{}tr(clf.get_params())).hexdigest()\n",
    "    print(\"storing meta predictions at: {}\"{}ave_preds)\n",
    "    np.save(\"{}_{}_{}_train.npy\".format((save_preds,clf_name,avg_loss,id),dataset_blend_train))\n",
    "    np.save(\"{}_{}_{}_test.npy\".format((save_preds,clf_name,avg_loss,id),dataset_blend_test))\n",
    "\n",
    "  if len(save_test_only)>0:\n",
    "    id = md5.new(\"{}\"{}tr(clf.get_params())).hexdigest()\n",
    "    print(\"storing meta predictions at: {}\"{}ave_test_only)\n",
    "\n",
    "    dataset_blend_test = clf.predict(X_test)\n",
    "    np.savetxt(\"{}_{}_{}_test.txt\".format((save_test_only,clf_name,avg_loss,id),dataset_blend_test))\n",
    "    d = {}\n",
    "    d[\"stacker\"] = clf.get_params()\n",
    "    d[\"generalizers\"] = generalizers_params\n",
    "    with open(\"{}_{}_{}_params.json\".format((save_test_only,clf_name,avg_loss, id), 'wb')) as f:\n",
    "      json.dump(d, f)\n",
    "\n",
    "  if len(save_params)>0:\n",
    "    id = md5.new(\"{}\"{}tr(clf.get_params())).hexdigest()\n",
    "    d = {}\n",
    "    d[\"name\"] = clf_name\n",
    "    d[\"params\"] = { k:(v.get_params() if \"\\n\" in str(v) or \"<\" in str(v) else v) for k,v in clf.get_params().items()}\n",
    "    d[\"generalizers\"] = generalizers_params\n",
    "    with open(\"{}_{}_{}_params.json\".format((save_params,clf_name,avg_loss, id), 'wb')) as f:\n",
    "      json.dump(d, f)\n",
    "\n",
    "  if np.unique(y).shape[0] == 2: # when binary classification only return positive class proba\n",
    "    if return_score:\n",
    "      return dataset_blend_train[:,1], dataset_blend_test[:,1], avg_loss\n",
    "    else:\n",
    "      return dataset_blend_train[:,1], dataset_blend_test[:,1]\n",
    "  else:\n",
    "    if return_score:\n",
    "      return dataset_blend_train, dataset_blend_test, avg_loss\n",
    "    else:\n",
    "      return dataset_blend_train, dataset_blend_test"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
